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A central biological question is how natural organisms 
are so evolvable (capable of quickly adapting to new envi- 
ronments). A key driver of evolvability is the widespread 
modularity of biological networks-their organization as func- 
tional, sparsely connected subunits-but there is no consen- 
sus regarding why modularity itself evolved. While most 
hypotheses assume indirect selection for evolvability, here 
^ we demonstrate that the ubiquitous, direct selection pres- 
^ sure to reduce the cost of connections between network 
^ nodes causes the emergence of modular networks. Exper- 
^ iments with selection pressures to maximize network perfor- 
mance and minimize connection costs yield networks that 
^ are significantly more modular and more evolvable than con- 
^ trol experiments that only select for performance. These re- 
^ suits will catalyze research in numerous disciplines, includ- 
y—^ ing neuroscience, genetics and harnessing evolution for en- 
gineering purposes. 

A Long-standing open question in biology is how populations 
^; r\ are capable of rapidly adapting to novel environments, a 
O trait called evolvability A major contributor to evolvability is 
* ^ the fact that many biological entities are modular, especially the 
I many biological processes and structures that can be modeled as 
^ 'networks, such as brains, metabolic pathways, gene regulation 
and protein interactions | 3Fji9j36j42| jsjgtworks are modular if they 
contain highly connected cluster s of nod es that are sparsely con- 
»^ nected to nodes in other Despite its importance 

^T) and decad es of re search, there is no agreement on why modular- 
ity evolvesE^lSE3 Intuitively, modular systems seem more adapt- 
able, a lesson well-known to human engineer s^°, because it is eas- 
ier to rewire a modular netwo rk wi th functional subunits than an 
entangled, monolithic networl@^. However, because this evolv- 
ability only provides a selective advantage over the long-term, 
(N| such selection is at best indirect and may not be strong enough to 
T-H explain the level of modularity in the natural world^^^. 

Modularity is likely caused by multiple forces acting to var- 
. ^ ious degrees in different context^, and a comprehensive un- 
derstanding of the evolutionary origins of modularity involves 
^ identifying those multiple forces and their relative contributions. 
^ The leading hypothesis is that modularity mainly emerges due 
to rapidly changing environments that have common subprob- 
lems, but different overall problems^^^"^. Computational sim- 
ulations demonstrate that in such environments (called modu- 
larly varying goals: MVG), networks evolve both modularity 
and evolvabilit}E3^. In contrast, evolution in unchanging en- 
vironments produces non- modu lar networks that are slower to 
adapt to new environment&^323 Follow-up studies support the 
modularity-generating force of MVG in nature: the modular- 
ity of bacterial metabolic networks is correlated with the fre- 
quency with which their environments change^^. It is unknown 
how much natural modularity MVG can explain, however, be- 
cause it unclear how many biological environments change mod- 
ularly, and whether they change at a high enough frequency for 
this force to play a significant role^^. A related theory that also 
assumes a constantly changing environment and selection for 
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Fig. 1. Main hypothesis. Evolving networks with selection for perfor- 
mance alone produces non-modular networks that are slow to adapt to 
new environments. Adding a selective pressure to minimize connection 
costs leads to the evolution of modular networks that quickly adapt to new 
environments. 



evolvability is that modularity arises to enable modifying one 
subcomponent without affecting others^^. There are other plausi- 
ble hypotheses (reviewed in Wagner 2007), including that vari- 
ation mechanisms, such as gene duplication, create a bias to- 
wards the generation of modular structures'^ and that modularity 
evolves due to selection to make phenotypes robust to environ- 
mental perturbations'^. 

We investigate an alternate hypothesis that has been suggested, 
but heretofore untested, which is that modularity evolves not be- 
cause it conveys evolvability, but as a byproduct from selection to 
reduce connection costs in a network (Fig.jlJ^^ Such costs in- 
clude manufacturing connections, maintaining them, the energy 
to transmit along them, and signal delays, all of whi ch increase as 
a function of connection length and numbeiEEEEH. The concept 
of connection costs is straightforward in networks with physi- 
cal connections (e.g. neural), but may also apply to other types 
of networks like metabolic and genetic pathways, as illustrated 
by the following examples; adding links in a pathway may in- 
cur delays; adding binding sites may slow replication and phys- 
ically make regulation inefficient because polymerases recruited 
to upstream promoters would have to traverse longer stretches 
of sequence to reach protein-coding gene sequences; adding pro- 
tein and metabolic interactions may increase constraints. The 
strongest evidence that biological networks face direct selection 
to minimize connection costs comes from the vascular system 
and from nervous systems, including the brain, where multiple 
studies suggest that the summed length of the wiring diagram 
has been minimized, either by reducing lon^ connections or by 
optimizing the placement of neurons^ ^ Founding^^ and 

modern neuroscientists have hypothesized that direct selection 
to minimize connection costs may, as a side-effect, cause modu- 
larity. This hypothesis has never been tested in the context of evo- 
lutionary biology. The most related study was on non-evolving, 
simulated neural networks with a specific within-life learning al- 
gorithm that produced more modularity when minimizing con- 
nection length in addition to performance, although the gen- 
erality of the result was questioned when it was not replicated 
with other learning algorithms^. Without during-life learning al- 
gorithms, carefully-constructed MVG environments, or mutation 
operators strongly biased toward creating modules, attempts to 
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Fig. 2. The addition of connection costs leads to higher-performing, functionally modular networks. (A) Networks evolve to recognize patterns 
(objects) in an eight-pixel retina. The problem is modularly decomposable because whether an object exists on the left and right sides can be 
separately determined before combining that information to answer whether objects exist on both sides (denoted by the AND logic function). (B) 
Networks from an example trial become more modular across evolutionary time (see SI for video) with a pressure to minimize connection costs 
in addition to performance (P&CC). (C) Median performance (± 95% bootstrapped confidence intervals) per generation of the highest-performing 
network of each trial, which is perfect only when minimizing connection costs in addition to performance. (D) Network modularity, which is significantly 
higher in P&CC trials than when selecting for performance alone (PA). (E) The 12 highest-performing PA networks, each from a separate trial. (F) The 
12 highest-performing P&CC networks, which are functionally modular in that they have separate modules for the left and right subproblems. Nodes 
are colored according to membership in separate partitions when making the most modular split of the network (see text). The final networks of all 50 
trials are visualized in Fig. ^ (G,H) The cost and modularity of PA and P&CC populations across generations, pooled from all 50 trials. A connection 
cost pushes populations out of high-cost, low-modularity regions towards low-cost, modular regions. Fig.[3]shows the fitness potential of each map 
area. 



evolve modularity in neural networks have 

Given the impracticality of observing modularity evolve in 
biological systems, we follow most research on the subject by 
conducting experiments in computational systems with evolu- 
tionary dynamics^^^^^^. Specifica lly, we use a well-studied sys- 
tem from the MVG investigation£32EH evolving networks to 
solve pattern-recognition tasks and Boolean logic tasks (Meth- 



ods). These networks have inputs that sense the environment 
and produce outputs (e.g. activating genes, muscle commands, 
etc.), which determine performance on environmental problems. 
We compare a treatment where the fitness of networks is based 
on performance alone (PA) to one based on two objectives: maxi- 
mizing performance and minimizing connection costs (P&CC). A 
multi-objective evolutionary algorithm is used with one (PA) 
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Fig. 3. The highest-performing networks found for each combina- 
tion of modularity and cost for the retina problem. Colors indicate the 
highest-performing network found at that point in the modularity vs. cost 
space, with yellow representing perfect performance. This map has been 
generated using the IVIOLE algorithm (IVIethods). The best-performing 
network at the end of each of the 50 PA and P&CC runs are overlaid on 
the map. Networks with perfect performance exist throughout the space, 
which helps explain why modularity does not evolve when there is selec- 
tion based on performance alone. Below a cost threshold of around 125 
there is an inverse correlation between cost and modularity for perfectly 
performing networks. The lowest cost networks-those with the shortest 
summed lengths-that are high-performing are modular. 



or two (P&CC) objectives: To reflect that selection is stronger 
on network performance than connection costs, the P&CC cost 
objective affects selection probabilistically only 25% of the time, 
although the results are robust to substantial changes to this 
value (Methods). Two example connection cost functions are in- 
vestigated. The default one is the summed squared length of 
all connections, assuming nodes are optimally located to mini- 
mize th is sum (Methods), as has been found for animal nervous 
systemJ33i3. A second measure of costs as solely the number of 
connections yields qualitatively similar results to the default cost 
function, and may better represent biological networks without 
connections of different lengths. More fit networks tend to have 
more offspring (copies that are randomly mutated), and the cycle 
repeats for a preset number of generations (Fig.jl] Methods). Such 
computational evolving systems have substantially improved 
our understanding of natural evolutionary dynamics^^ 

The main experimental problem involves a network that re- 
ceives stimuli from eight inputs^^. It can be thought of as an eight- 
pixel retina receiving visual stimuli, although other analogies are 
valid (Methods), such as a genetic regulatory network exposed 
to environmental stimuli. Patterns shown on the retina's left and 
right halves may each contain an 'object', meaning a pattern of 
interest (Fig. [2^). Networks evolve to answer whether an object is 
present on both the left and right sides of the retina (the L-AND-R 
environment) or whether an object is displayed on either side (the 
L-OR-R environment). Which patterns count as an object on the 
left and right halves are slightly different (Fig. 2a). 

Each network iteratively sees all possible 256 input patterns 
and answers true (> ) or false (< 0). Its performance is the 
percentage of correct answers, which depends on which neurons 
are connected, how strongly, and whether those connections are 
inhibitory or excitatory (Methods). Networks are randomly gen- 
erated to start each experiment. Their connections stochastically 
mutate during replication (Methods). Network modularity is 
evaluated with an efficient approximation^^'^^' of the widely used 
modularity metric Q, which first optimally divides networks into 
modules then measures the difference between the number of 
edges within each module and the num ber ex pected for random 
networks with the same number of edgeOT 
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Fig. 4. Evolving with connection costs produces networks that are 
more evolvable. (A) P&CC networks adapt faster to new environments 
than PA networks. Organisms were first evolved in one environment (e.g. 
L-AND-R) until they reached perfect performance and then transferred 
to a second environment (e.g. L-OR-R). Thick lines are medians, boxes 
extend from 25*^ to 75*^ data percentiles, thin lines mark 1.5 x IQR (in- 
terquartile range), and plus signs represent outliers. Fig. ^is a zoomed- 
out version showing all of the data. (B) P&CC networks in an unchanging 
environment (dotted green line) have similar levels of modularity to the 
highest levels produced by MVG (solid blue line). Combining IVIVG with 
P&CC results in even higher modularity levels (solid green line), showing 
that the forces combined are stronger than either alone. 



Results 

After 25000 generations in an unchanging environment (L-AND- 
R), treatments selected to maximize performance and minimize 
connection costs (P&CC) produce significantly more modular 
networks than treatments maximizing performance alone (PA) 
(Fig.|2]i, Q = 0.42,95% confidence interval [0.25,0.45] vs. Q = 
0.18[0.16,0.19], p = 8 X 10"°^ using Matlab's Mann-Whitney- 
Wilcoxon rank sum test, which is the default statistical test unless 
otherwise specified). To test whether evolved networks exhibit 
functional modularity corresponding to the left-right decomposi- 
tion of the task we divide networks into two modules by selecting 
the division that maximizes Q and color nodes in each partition 
differently Left-right decomposition is visually apparent in most 
P&CC trials and absent in PA trials (Fig.[2^,f). Functional modu- 
larity can be quantified by identifying whether left and right in- 
puts are in different partitions, which occurrs in 56% of P&CC 
trials and never with PA (Fisher's exact test, p = 4xl0~^^). Pairs 
of perfect sub-solution neurons-whose outputs perfectly answer 
the left and right subproblems-occur in 39% of P&CC trials and 
0% of PA trials (Fisher's exact test, p = 3 x 10"^ Fig. 

Despite the additional constraint, P&CC networks are 
significantly higher-performing than PA networks (Fig. |2j:). 
The median-performing P&CC network performs per- 
fectly (1.0 [1.0, 1.0]), but the median PA network does not 
(0.98[0.97,0.98] , p = 2 x 10"°^). P&CC performance may 
be higher because its networks have fewer nodes and con- 
nections (Fig. ^9]i,e), meaning fewer parameters to optimize. 
Modular structures are also easier to adapt since mutational 
effects are smaller, being confined to subcomponents'^^. While 
it is thought that optimal, non-modular solutions usually out- 
perform optimal, modular designs, such 'modularity overhead' 
only exists when comparing optimal designs, and is not at odds 
with the finding that adaptation can be faster and ultimately more 
successful with a bias towards modular solutions'^^. 

To better understand why the presence of a connection cost in- 
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'^^ L-OR-R^L-AND-R: 12.0[7.0, 21.0] vs. 
9 X 10~^^°). Modular networks thus 



creases performance and modularity, we searched for the highest- 
performing networks at all possible combinations of modularity 
and cost (Methods). For high-performing networks, there is an 
inverse correlation between cost and modularity, such that the 
lowest-cost networks are highly modular (Fig.|3|. Many runs in 
the P&CC treatment evolved networks in this region whereas the 
PA treatments never did. There are also many non-modular, high- 
cost networks that are high-performing, helping to explain why 
modularity does not evolve due to performance alone (Fig. |3j. 
Comparing PA vs. P&CC populations across generations reveals 
that a connection cost pushes populations out of high-cost, low- 
modularity areas of the search space into low-cost, modular ar- 
eas (Figs. |2^,h). Without the pressure to leave high-cost, low- 
modularity regions, many PA networks remain in areas that ul- 
timately do not contain the highest-performing solutions (Fig. |3] 
dark green squares in the bottom right), further explaining why 
P&CC treatments have higher performance. 

P&CC networks are also more evolvable than PA networks. 
We ran additional trials until 50 P&CC and 50 PA trials each 
had a perfectly performing network (Methods) and transferred 
these networks into the L-OR-R environment, which has the same 
subproblems in a different combination (Fig. The presence 
(P&CC) or absence (PA) of a connection cost remained after the 
environmental change. We also repeated the experiment, ex- 
cept first evolving in L-OR-R and then transferring to L-AND- 
R. In both experiments, P&CC networks exhibit greater evolv- 
ability than PA by requiring fewer generations to adapt to the 
new environment (Fig. [3^, L-AND-R^ L-OR-R: 3.0[2. 0,5.0] vs. 
65[62,69],p = 3 X 10 
222. 5[175.0, 290.0], p 
evolve because their sparse connectivity has lower connection 
costs, but such modularity also aids performance and evolvabil- 
ity because the problem is modular. 

Minimizing connection costs can work in conjunction with 
other forces to increase modularity. Modularity levels are higher 
when combining P&CC with MVG environments (Fig.jiJ^: solid 
vs. dotted green line, p = 3 x 10~^). Overall, P&CC (with or 
without MVG) yields similar levels of modularity as MVG at its 
strongest, and significantly more when rates of environmental 
change are too slow for the MVG effect to be strong (Fig.|4] green 
lines vs. blue solid line). 

P&CC modularity is also higher than PA even on problems that 
are non-modular (Fig. |5^, p = 5.4 x 10~^^). As to be expected, 
such modularity is lower than on modular problems (p — 0.0011, 
modular retina vs. non-modular retina). This non-modular prob- 
lem involves answering whether any four pixels were on (black), 
which is non-modular because it requires information from all 
retina inputs. As mentioned previously, performance and mod- 
ularity are also significantly higher with an alternate connection 
cost function based on the number of connections (P&CC-NC) in- 
stead of the length of connections (Fig.jSJ. We also verified that 
modularity and performance are not higher simply because a sec- 
ond objective is used (Fig.|6j. We further tested whether modu- 
larity arises even when the inputs for different modules are not 
geometrically separated, which is relevant when cost is a func- 
tion of connection length. Even in experiments with randomized 
input coordinates (Methods), a connection cost significantly in- 
creased performance ( 1.0 [0.98, 1.0], p = 0.0012) and modularity 
(Q = 0.35[0.34, 0.38], p = 1 x 10"^), and performance and mod- 
ularity scores were not significantly different than P&CC without 
randomized coordinates (Fig. 

All the results presented so far are qualitatively similar in a dif- 
ferent model system: evolving networks to solve Boolean logic 
tasks. We tested two fully separable problems: one with five ''ex- 
clusive or'' (XOR) logic modules (Fig.lsb), and another with hi- 
erarchically nested XOR problems (FigTst). P&CC created sepa- 
rate modules for the decomposed problems in nearly every trial, 
whereas PA almost never did (Figs. ^ P&CC Performance 
was also significantly higher (Fig. [5]3,c), and there was an an in- 
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Fig. 6. Alternate Cost Functions Performance and modularity are signif- 
icantly higher (p < 0.0001) either with a cost function based on the length 
(P&CC) or number (P&CC-NC) of connections versus performance alone 
(PA) or performance and a random objective (P&RO). P&RO assigned 
a random number to each organism instead of a connection cost score 
and maximized that random number. Fig. ^contains visualizations of all 
P&CC-NC networks. 



verse correlation between cost and modularity (Figs. ^10) . Con- 
firming the generality of the finding that connection costs im- 
prove adaptation rates and that high-performing, low-cost net- 
works are modular is an interesting area for future research. 

Discussion and Conclusion 

Overall, this paper supports the hypothesis that selection to re- 
duce connection costs causes modularity, even in unchanging en- 
vironments. The results also open new areas of research into 
identifying connection costs in networks without physical con- 
nections (e.g. genetic regulatory networks) and investigating 
whether pressures to minimize connection costs may explain 
modularity in human-created networks (e.g. communication and 
social networks). 

It is tempting to consider any component of modularity that 
arises due to minimizing connection costs as a 'spandrel', in that it 
emerges as a byproduct of selection for another trait^^. However, 
because the resultant modularity produces evolvability minimiz- 
ing connection costs may serve as a bootstrapping process that 
creates initial modularity that can then be further elevated by 
selection for evolvability. Such hypotheses for how modularity 
initially arises are needed, because selection for evolvability can- 
not act until enough modularity exists to increase the speed of 
adaptation^^. 

Knowing that selection to reduce connection costs produces 
modular networks will substantially advance fields that har- 
ness evolution for engineering, because a longstanding challenge 
therein has been evolving modular design^i^^^^J^iJ^. It will ad- 
ditionally aid attempts to evolve accurate models of biol ogical 
networks, which catalyze medical and biological research0^3^. 
The functional modularity generated also makes synthetically 
evolved networks easier to understand. These results will thus 
generate immediate benefits in many fields of applied engineer- 
ing, in addition to furthering our quest to explain one of nature's 
predominant organizing principles. 

Methods 

Experimental Parameters Each treatment is repeated 50 times 
with different stochastic events (i.e. different random number 
generator seeds). Analyses and visualizations are of the highest- 
performing network per trial with ties broken randomly. The 
main experiments (retina, non-modular retina, 5-XOR, and hier- 
archical XOR) last 25000 generations and have a population size 
of 1000. 

Statistics For each statistic we report the median +/- 95% boot- 
strapped confidence intervals of the median (calculated by resam- 
pling the data 5,000 times). In plots, these confidence intervals are 
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Fig. 5. Results from tests with different environmental problems. (A) Even on a non-modular problem, modularity is higher with P&CC, though it 
is lower than for modular problems. (B, C) P&CC performs better, is more modular, and has better functional decomposition than PA when evolving 
networks to solve five separate XOR functions and hierarchically nested XOR functions. The examples are the final, highest-performing networks per 
treatment. Figs. ^ ^ and ^show networks from all trials. 



smoothed with a median filter (window size=200) to remove sub- 
sampling noise. 

Length Cost For P&CC treatments, prior to calculating connec- 
tion costs, we place nodes in positions optimal for minimizing 
connection costs given the topology of the network, which is bi- 
ologically motivatedHmiland can be solved for mathematicall}/^^. 
Inputs and outputs are at fixed locations (SI Text). Visualizations 
reflect these node placements. 

Evoivabllity Experiment The evolvability experiments (Fig. 
|4j\) are described in Fig. ^ To obtain 50 trials that each had 
a perfectly performing network in L-AND-R and L-OR-R, respec- 
tively took 110 and 116 trials for P&CC and 320 and 364 trials 
for PA. 1000 clones of each of these networks then evolved in the 
alternate environment until performance was perfect or 5000 gen- 
erations passed. 

Overview of Network Models This section provides a brief 
overview of the network models in this paper. A more complete 
review of network models is provided in Alon 2007 ^, Haykin 
1998^, Newman 2003 3^, Barabasi 2004 and cites therein. 

Network models can represent many types of biological pro- 
cesses by representing interactions between components ^ ^ . Ex- 
ample biological systems that are commonly modeled as net- 
works are neural, genetic, metabolic, and protein interaction net- 
works. All such networks can be represented abstractly as nodes 
representing components, such as neurons or genes, and the in- 
teractions between such components, such as a gene inhibiting 
another or gene. The weight of connections indicates the type 
and strength of interactions, with positive values indicating acti- 
vation, negative values indicating inhibition, and the magnitude 
of the value representing the strength of the interaction. 

Multiple nodes can connect to form a network (e.g. Fig. ^^)- 
Typically, information flows into the network via input nodes, 
passes through hidden nodes, and exits via output nodes. Examples 
include a gene regulatory network responding to changing levels 
of environmental chemicals or a neural network responding to 
visual inputs from the retina and outputting muscle commands. 

Our model of a network is a standard, basic one used in ma- 
chine learning's, systems biolog}/^, and computational neuro- 



science It also has been used in previous, landmark stud- 
ies on the evolution of modularit}/'^^^ , The networks are feed- 
forward, meaning that nodes are arranged into layers, such that a 
node in layer n receives incoming connections only from nodes 
in layer n-i and has outgoing connections only to nodes in layer 
n+i (Fig. ^l]^). The maximum number of nodes per hidden layer 
is 8/4/2 for the three hidden layers in the retina problem, 8 for 
the single hidden layer in the 5-XOR problem, and 8/4/4 for 
the three hidden layers in the hierarchical XOR problem. The 
possible values for connection weights are the integers -2,-1,1, 
and 2. The possible values for thresholds (also called biases) are 
the integers -2,-1,0,1, and 2. Information flows through the net- 
work in discrete time steps one layer at a time. The output of 
each node in the network is the following function of its inputs: 

tanh^A (^.^j , '^ijVi+f'^^ where yj is the output of node j, i is 

the set of nodes connected to 7, IS the strength of the connec- 
tion between node i and node j, yi is the output of node i, and h 
is a threshold (also called a bias) that determines at which input 
value the output transitions from negative to positive. The tanh(x) 
transfer function ensures an output range of [-1,1]. a (here, 20) de- 
termines the slope of the transition between these inhibitory and 
excitatory extremes (Fig. ^l]:). This network model can approx- 
imate any function with an arbitrary precision provided that it 
contains enough hidden nodesi^ 

Evolutionary Algorithm The evolutionary algorithm is based 
on research into algorithms inspired by evolution that simulta- 
neously optimize several objectives, called multi-objective algo- 
rithms'^^. These algorithms search for, but are not guaranteed to 
find, the set of optimal trade-offs: i.e., solutions that cannot be 
improved with respect to one objective without decreasing their 
score with respect to another one. Such solutions are said to be 
on the Pareto Fronl^, described formally below. These algorithms 
are more general than algorithms that combine multiple objec- 
tives into a single, weighted fitness function, because the latter 
necessarily select one set of weights for each objective, whereas 
multi-objective algorithms explore all possible tradeoffs between 
objectives'^. 

The specific multi-objective algorithm in this paper is the 
widely used Nondominated Sorting Genetic Algorithm, version 
II (NSGA-II)i4 (Fig. ^). As with most modern multi-objective 
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evolutionary algorithms, it relies on the concept of Pareto domi- 
nance, defined as follows: 

An individual x* is said to dominate another individual x, if 
both conditions 1 and 2 are true: (1) x* is not worse than x with 
respect to any objective; (2) x* is strictly better than x with respect 
to at least one objective. 

However, this definition puts the same emphasis on all objec- 
tives. In the present study, we take into account that the first ob- 
jective (performance) is more important than the second objective 
(optimizing connection cost). To reflect this, we use a stochastic 
version of Pareto dominance in which the second objective is only 
taken into account with a given probability p. Lower values of p 
cause lower selection pressure on the second objective. Our re- 
sults are robust to alternate values of p, including up to p=i.o for 
static environments (Fig. ^6^-d) and p=o.95 for environments with 
modularly varying goals (Fig. ^6^). 

This stochastic application of the second objective is imple- 
mented as follows. Let r denote a random number in [o;i] and 
p the probability to take the second objective into account. A so- 
lution X* is said to stochastically dominate another solution x, if 
one of the two following conditions is true: (1) r>p and x* is bet- 
ter than X with respect to the first objective; (2) r<p and x* is not 
worse than x with respect to either objective and x* is better than 
x with respect to at least one objective. 

Stochastic Pareto dominance is used in the algorithm twice 
(Fig. ^1^): (1) To select a parent for the next generation, two in- 
dividuals XI and X2 are randomly chosen from the current popu- 
lation; if XI stochastically dominates X2, then xi is selected, if X2 
stochastically dominates xi, then is selected. If neither domi- 
nates the other, the individual selected is the one which is in the 
less crowded part of the objective spac^5 (2) To rank individu- 
als, the set of stochastically non-dominated solutions is first iden- 
tified and called the first Pareto layer (rank=i, e.g. h in Fig. ^1^); 
these individuals are then removed and the same operation is per- 
formed to identify the subsequent layers (additional ranks corre- 
sponding to h, h, etc. in Fig. ^1^). 

Mutational Effects Mutations operate in essentially the same 
way as in the study by Kashtan and Alor22. In each generation, 
every new network is randomly mutated (Fig. S 1^). Four kinds 
of mutation are possible, which are not mutually exclusive: (1) 
each network has a 20% chance of having a single connection 
added. Connections are added between two randomly chosen 
nodes that are not already connected and belong to two consecu- 
tive layers (to maintain the feed-forward property described pre- 
viously); (2) each network has a 20% chance of a single, randomly 
chosen connection being removed; (3) each node in the network 
has a 1 /24=4. 16% chance of having its threshold (also called its bias) 
incremented or decremented, with both options equally likely; 
five values are available (-2,-1,0,1,2); mutations that produce val- 
ues higher or lower than these five values are ignored; (4) each 
connection in the network has a separate probability of being in- 
cremented or decremented of 2.0/n, where n is the total number of 
connections of the network. Four values are available (-2,-1,1,2); 
mutations that produce values higher or lower than these four 
values are ignored. 

The results in this manuscript are robust to varying these pa- 
rameters. Because having more mutational events that remove 
connections than add them might also produce sparsely con- 
nected, modular networks, we repeated the main experiment 
with mutation rates biased to varying degrees (Fig. ^,b). These 
experiments show that even having remove-connection events be 
an order of magnitude more likely than add-connection events 
does not produce modular networks. In each of the experiments 
with biased mutation rates, the modularity and performance of 
the P&CC treatment with default mutation values was signifi- 
cantly greater than the PA treatment with biased mutation rate 
values (Fig. d9k,b). 



Our main results are qualitatively the same when weights and 
biases are real numbers (instead of integers) and mutated via 
Gaussian perturbation. Nodes are never added nor removed. For 
clarity, following nodes without any connections are not dis- 
played or included in results. 

Multi-Objective Landscape Exploration (MOLE) Algorithm 

It would be helpful to visualize the constraints, tradeoffs, and 
other correlations between different phenotypic dimensions that 
occur in evolving systems (e.g. in this study the performance, 
modularity, and length for each possible network). Because it is 
impossible for this study to determine such information via ex- 
haustive or random search (SI Text), we designed an algorithm 
to find high-performing solutions with different combinations 
of modularity and length scores. We call this algorithm Multi- 
Objective Landscape Exploration (MOLE). It is a multi-objective 
optimization search^^ (Fig. dlfe) with two objectives. The first ob- 
jective prioritizes individuals that have high performance. The 
second objective prioritizes individuals that are far away from 
other individuals already discovered by the algorithm, where dis- 
tance is measured in a Cartesian space with connection costs on 
the X axis and modularity on the y axis. Algorithms of this type 
have been shown to better explore the search space because they 
are less susceptible to getting stuck in local optima^^. Thus, unlike 
a traditional evolutionary algorithm that will only be drawn to a 
type of solution if there is a fitness gradient towards that type of 
solution, MOLE searches for high-performing solutions for every 
possible combination of modularity and cost scores. While this 
algorithm is not guaranteed to find the optimal solution at each 
point in the space, it provides a focused statistical sampling of 
how likely it is to discover a high-quality solution in each area of 
the search space. The MOLE maps in this paper show the highest 
performing network at each point in this Cartesian space found 
in 30 separate runs. 

Exhaustive and random search fail to discover high- 
performing networks In this study it would help to know the 
performance, modularity, and length for each solution in the 
search space. If the search space is small enough, such values 
can be determined by exhaustively checking every possible so- 
lution. Such an approach is intractable for the problems in this 
manuscript. For example, for the main problem in the paper, 
which is the retina problem, the number of possible weights is 

NumberO /Weights = (8x8) + (8x4) + (4x2) + (2x l)+23 = 129 

(1) 

due to the maximum possible number of nodes in the input, hid- 
den, and output layers, as well as the bias for each of the 23 pos- 
sible nodes. Each of these weights can be one of four values or a 
zero if no connection exists, and biases can be one of five values, 
leading to a search space of 

NumberOf Solutions = 5^^^ = 10^°. (2) 

Given that it takes on average 0.0013 seconds to assess the fit- 
ness of a solution across all possible 256 inputs using a modern 
computer, it would take 4.ixio^^ years of computing time to ex- 
haustively determine the performance, modularity, and cost for 
each solution in the search space. 

Because it is infeasible to exhaustively search the space, we 
randomly sampled it to see if we would find high-performing 
solutions with a variety of cost and modularity scores. Specifi- 
cally, we randomly generated more than two billion solutions, but 
every solution had poor performance. The highest-performing 
solution gave the correct answer for only 62 out of 256 retina 
patterns (24.2%), which is far below the performance of 93% or 
greater for solutions routinely discovered by the evolutionary al- 
gorithm (Fig. [2]:). We thus concluded that randomly sampling 
the space would not lead to the discovery of high-performing so- 
lutions. 
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Geometric coordinates Nodes exist at two-dimensional (i.e. 
x,y) Cartesian locations. The geometric coordinates of the inputs 
and outputs for all problems were fixed throughout evolution, 
including the treatment where the within-row location of inputs 
are randomized at the beginning of each separate trial, and are as 
follows. The inputs for all problems have y values of o. For the 
retina problem, the x values for the inputs are {-3.5,-2.5,.. .,3.5} and 
the output is at {4,0}. For the problem with 5 XOR modules, the x 
values for the inputs are {-4.5,-3.5,. ..,4.5} and the outputs all have 
y values of 2 with x values of {-4,-2,0,2,4}. For the problem with 
decomposable, hierarchically nested XOR functions, the x values 
for the inputs are {-3.5,-2.5,. ..,3.5} and the outputs all have y val- 
ues of 4 with X values of {-2,2}. The geometric location of nodes 
is consequential only when there is a cost for longer connections 
(i.e. the main P&CC treatment). 

Video of networks from each treatment evolving across gen- 
erations A video is provided to illustrate the change in net- 
works across evolutionary time for both the PA and P&CC treat- 
ments. It can be viewed at http: / / chronos.isir.upmc.fr/~mouret/tmp/ 
Imodularity. avi 1 In it, the networks are visualized as described in the 
text. 
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Fig. S1. Overview of the model. (A) The multi-objective evolutionary algorithm (NSGA-II). Starting with a population of randomly generated 
individuals, an offspring population of n new individuals is generated using the best individuals of the current population. The union of the offspring 
and the current population is then ranked according to the stochastic Pareto dominance (explained in Methods, here represented by having organisms 
in different ranks connected by lines labeled Li, L2, etc.) and the best n individuals form the next generation. (B) An example network model. 
Networks can model many different biological processes, such as genetic regulatory networks, neural networks, metabolic networks, and protein 
interaction networks. Information enters the network when it is sensed as an input pattern. Nodes, which represent components of the network (e.g. 
neurons or genes), respond to such information and can activate or inhibit other components of the network to varying degrees. The strength of 
interactions between two nodes is represented by the weight of the connection between them, which is a scalar value, and whether the interaction is 
excitatory or inhibitory depends on whether the weight is positive or negative. In this figure all non-zero weights are represented by an arrow. The 
signal entering each node is passed through a transfer function to determine the output for that node. That output then travels through each of the 
node's outgoing connections and, after being scaled by the weight of that outgoing connection, serves as a component of the incoming signal for 
the node at the end of that connection. Eventually an output pattern is produced, which for a neural network could be muscle commands or for a 
genetic regulatory network could be proteins. (C) The transfer function for each node. The sum of the incoming signal {x) for a node is multiplied by 20 
before being passed through the transfer function tanh(x). Multiplying by 20 makes the transition steep and similar to a step-like function. The tanh(x) 
function ensures that the output is in the range [-i,i]. 
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Fig. S2. Network visualizations and data from the main experiment. The highest-performing network at the end of each trial is visualized here, 
sorted first by fitness and second by modularity. For each network the performance (left-justified) and modularity score q (right-justified) are shown. 
An "L' or "R" indicates that the network has a single neuron that perfectly solves the left or the right subproblem, respectively. (A) Networks from the 
treatment where the only selection pressure is on the performance of the network. (B) Networks from the treatment where there are two selection 
pressures: one to maximize performance and one to minimize connection costs. 
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Fig. S3. Network visualizations and data for the experiment with five decomposable XOR Boolean logic functions. The highest-performing 
network at the end of each trial, sorted by fitness then modularity, and its performance (left-justified) and modularity score q (right-justified). (A) The 
environmental challenge in this experiment. (B and C) Networks from the PA and P-IVICC treatments, respectively. 
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Fig. S4. Network visualizations and data for the experiment with decomposable, hierarchically nested XOR Boolean logic functions. 

The highest-performing network at the end of each trial is visualized here, sorted first by fitness and second by modularity. For each network the 
performance (left-justified) and modularity score q (right-justified) are shown. (A) A depiction of the environmental challenge in this experiment. (B) 
Networks from the treatment where the only selection pressure is on the performance of the network. (0) Networks from the treatment where there 
are two selection pressures: one to maximize performance and one to minimize connection costs. 
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Fig. S5. Network visualizations and data from the experiment with a non-modular problem. (A) A non modular version of the retina problem 
is created when networks have to answer whether any four pixels are on, which is non-modular because it involves information from anywhere in the 
retina. (B, C) The highest-performing network at the end of each trial is visualized here, sorted first by fitness and second by modularity. For each 
network the performance (left-justified) and modularity score q (right-justified) are shown. (B) shows networks from the treatment where the only 
selection pressure is on the performance of the network. (C) shows networks from the treatment where there are two selection pressures: one to 
maximize performance and one to minimize connection costs. 
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Fig. S6. Modularity and performance results are robust to variation in the strength of selection to minimize connection costs. (A) Median 
performance per generation of the highest-performing network, which is improved by adding a selection pressure to minimize connection costs. The 
value of p indicates the strength of the pressure (IVIethods), ranging from 0% (no pressure) through 10% (low pressure) to 100% (high pressure- 
equal to that for the main performance objective). (B) Performance of the highest-performing networks after 25000 generations for different values of 
p. The differences between PA (p= 0%) and P&CC {p> 0%) are highly significant (four asterisks indicate a p-value <o.ooi). (C) Median modularity 
per generation of the highest-performing network of each trial for different values of p. (D) Modularity of the highest-performing networks after 25000 
generations, which is significantly higher when there is a selection pressure to reduce connection costs (i.e. when p ^ 0%). (E) A changing environment. 
The cc-axis represents the number of generations between a switch from the L-AND-R environment to the L-OR-R environment or vice-versa. This 
figure is the same as Fig.[4]from the main text, but shows results from additional values of p. The results for P&CC (piQ) are consistent across 
different values of p provided that the strength of selection on network connection costs is not equal to that of network performance (i.e. for all p \ 
100%). The p=100% line is anomalous because in that case cost is as important as performance, leading to mostly low-performing networks with 
pathological topologies: this effect is strongest when the environment changes quickly because such changes frequently eliminate the fitness benefits 
of high performance. 
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Fig. S7. Environmental changes between two modular problems with shared subproblems. (A) In the first evolutionary phase, networks evolve 
to answer whether a left and right object are present (the L-AND-R environment). The problem can be modularly decomposed, since the presence of 
left and right objects can first be separately detected before answering the larger question of whether they are both present. A left object is considered 
present if one of the eight left object patterns (shown in Fig.|2]in the main text) is exposed to the left side of the retina, and vice-versa for right objects. 
Networks that could perfectly solve the L-AND-R problem after 5000 generations are then transferred to an environment in which the challenge is to 
determine if a left or right object is present (the L-OR-R environment), which is a different overall problem that shares the subproblems of identifying 
left and right objects. Evolution continues until networks can perfectly solve the L-OR-R problem or until 5000 generations elapse. (B) The same 
experiment is repeated, but first evolving in the L-OR-R environment and then transferring to the L-AND-R environment. (C) A fully zoomed-out plot of 
Fig.[4|from the main text. Trials last 5000 generations or until a network exhibits perfect performance in the new environment. For each treatment, for 
each of 50 networks that perform perfectly in the first environment, we conduct 50 trials in the second environment, meaning that each column in this 
figure (and Fig.|4]from the main text) represent 50x50 = 2500 data points. 
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Fig. S8. (A) Repeating the main experiment wherein selection is based on performance and connection costs (P&CC), but with connection costs 
solely measured as the number of connections (P&CC-NC), produces networks that look qualitatively similar to those produced with the default P&CC 
cost function (compare the networks visualized here to those in Fig. The modularity and performance scores are also qualitatively similar and not 
significantly different (modularity q=o.36[o.22,o.4] vs. default cost function q=o.42[o.25,o.45],p=o.i5; performance =i.o[o. 99,1.0] vs. default cost function 
performance of i.o[i.o,i.o],p=i.o). Like the default cost function, the alternative cost function produced significantly more modularity (p=ixio-^) 
and higher performance (p=ixio-^) than PA. Also qualitatively similar between these cost functions is the percent of runs that have the inputs 
related to the left and right subproblems in separate modules after splitting the network once to maximize modularity (alternate: 52%, default: 56%, 
Fisher's exact test: p=o.84). One minor qualitative difference is the percent of runs that have two neurons that each perfectly solve the left and right 
subproblems, respectively (alternate: 10%, default: 39%, Fisher's exact test: p=8xio-^). (B) Randomizing the geometric location of input coordinates 
(P&CC-RL) eliminates the left-right geometric decomposition of the L-AND-R retina problem, yet such randomization does not change the result that 
a connection cost causes significantly higher performance and modularity. (C) Displayed are final, evolved P&CC-RL networks, which are shown both 
with randomized input coordinates (left) and with the left and right inputs in order to visualize problem decomposition (right). Despite the randomization 
of the input locations, many of the evolved modules still functionally decompose the left and right subproblems. 
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Fig. S9. (A,B) Biasing the mutation rate towards having more remove-connection mutations than add-connection mutations does not 
increase the modularity or performance of PA treatments. Setting the remove-connection mutation rate to be up to an order of magnitude higher 
than the add-connection mutation rate did not qualitatively change the results. The PA treatment with default values of 0.2 for the remove-connection 
and add-connection mutation rates did not have statistically different levels of modularity or performance from PA treatments with the same remove- 
connection mutation rate of 0.2, but a lower add-connection mutation rate of 0.15, 0.10, or 0.02 (p>o.o5). The P&CC treatment had statistically higher 
performance and modularity scores than every PA treatment, irrespective its mutation rate bias. (C,D,E) Networks evolved with a connection cost 
have shorter lengths, fewer nodes, and fewer connections. (C) The summed length of all connections is smaller for networks evolved in the P&CC 
regime than in the PA regime. The P&CC networks also have fewer nodes (D) and connections (E). Three asterisks indicate a value of p<o.ooi and 
four indicate a value of p<o.oooi. 
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Fig. S10. The highest-performing networks found for each combination of modularity and length for (A) the problem with five separable 
XOR functions and (B) the hierarchical XOR problem. To understand the performance potential of networks with different combinations of modular- 
ity levels and summed connection lengths we invented the IVIulti-Objective Landscape Exploration (IVIOLE) algorithm, which is described in IVIethods. 
Colors indicate the highest-performing network found at that point in the modularity vs. cost space, with yellow representing perfect performance. The 
best-performing network at the end of each of the 50 PA and P&CC runs are overlaid on the map. Networks with perfect performance exist through- 
out the space, which helps explain why modularity does not evolve when there is selection based on performance alone. There is also an inverse 
correlation between length and modularity for high-performing networks: The lowest cost networks-those with the shortest summed lengths-that are 
high-performing are modular. 
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